Amanda Birmingham, CCBB, UCSD
Currently the fitness-scoring pipeline for dual CRISPR screens requires a manual step wherein a researcher examines a histogram of the count distribution for each sample+timepoint and from this graphical representation, selects a threshold below which abundance constructs will not be included (in order to remove noisy data that can swamp the detection of real fitness effects).
The manual nature of this step prevents full automation of the analysis pipeline, and the fact that different researchers may make different abundance threshold choices introduces a point of variation in the analysis. I therefore seek to develop a deterministic heuristic to repeatably and automatically identify an acceptable abundance threshold value in arbitrary dual CRISPR screen data.
In the dual CRISPR manuscript, this description of the abundance-threshold issue is provided:
"To avoid fitting Eq. 3 to spurious data, we use only data points above a certain threshold (Supplementary Fig. 4). The threshold depends mainly on the size of the sample (number of cells) collected at a given time in relation to the size of the viral library and on the depth of sequencing. We note that the left-most peak in the histograms of (Supplementary Fig. 4) contains severely undersampled constructs with zero counts. Their x-coordinate corresponds to a pseudo-count of one introduced only for visualization purposes. It is arbitrary and therefore should not be used for fitting the model. Likewise, finite but very low counts are considered missing data. We set a threshold for every time point (red lines in Supplementary Fig. S4)."
According to the above-quoted manuscript text, the threshold depends on:
The existing code that preps the count data and creates the abundance histograms from which threshold choices are inferred is shown below:
#preliminary preparations of the input data frame
data<-data.matrix(X[,6:(5+2*nt)])
good<-(X$geneA != X$geneB) #reject any constructs with two 0's
goodX<-X[good,] #the 0-0 constructs are gone
nn<-sum(good) #this many constructs
cpA<-as.character(goodX$probeA)
ix<-grep("NonTargeting",cpA)
cpA[ix]<-paste("0",cpA[ix],sep="") #this puts NonTargeting probes at the beginning of alphabetically sorted order
cpB<-as.character(goodX$probeB)
ix<-grep("NonTargeting",cpB)
cpB[ix]<-paste("0",cpB[ix],sep="")
pswitch<-cpA>cpB #need to switch?
phold<-cpA[pswitch]
cpA[pswitch]<-cpB[pswitch]
cpB[pswitch]<-phold #cpA and cpB are always in alphabetical order, cpA < cpB
probes<-sort(unique(c(cpA,cpB))) #entire probe set in alphabetical order
nprobes<-length(probes)
cgA<-as.character(goodX$geneA)
cgB<-as.character(goodX$geneB)
genes<-sort(unique(cgA)) #should be 74 "genes"
n<-length(genes) # n = 74 if doing it by genes or 222 if doing it by probe
mm<-n*(n-1)/2
gswitch<-cgA>cgB #need to switch?
ghold<-cgA[gswitch]
cgA[gswitch]<-cgB[gswitch]
cgB[gswitch]<-ghold
gA_gB<-paste(cgA,cgB,sep="_")
pA_pB<-paste(cpA,cpB,sep="_")
goodX<-data.frame(goodX,cgA,cgB,gA_gB) #now gA_gB is ordered so that gA < gB
gooddata<-data.matrix(goodX[,6:(5+2*nt)])
gooddata[gooddata==0]<-1 #pseudocounts
abundance<-apply(gooddata,2,sum) # 2 means sum each column
y<-t(log2(t(gooddata)/abundance)) #log2 frequencies
ab_y<-gooddata
In making the plot, the only variable (beyond the per-construct counts themselves) that is used is "abundance"--that is, the sum of all the construct counts for a given sample. Undoubtedly it is fair to say that this overall abundance is influenced by all three of the factors listed above, although the nature of their influence a black box to us at the point we get the abundance.
options(jupyter.plot_mimetypes = c("text/plain", "image/png" ))
options(repr.plot.width=7, repr.plot.height=7)
options(digits=10)
I note that, for the purposes of threshold selection, every sample has its frequency calculated against its own personal abundance. Within a given sample, the abundance is a constant, so dividing by it is really just shifting the log2 values by a constant amount: logb(m/n) = logb(m) – logb(n) so log2(t(gooddata)/abundance) = log2(t(gooddata)) - log2(abundance). Since the thresholds are picked for each sample individually, rather than being picked with info taken from across multiple samples, I believe that we should be able to safely leave the abundance constant for a sample out of the threshold picking process for that sample and instead look straight at the log2(raw count) values. The counts files emitted by the counting pipeline include raw counts, which is what the heuristic will need to take as input.
Let us examine a distribution of count values (on a log2 scale) for some real data that is high-quality and well-behaved (Hela.CV4R4R_d3_1_S7_L001_001), using a slimmed-down and slightly modified version of the above plotting algorithm:
goodCounts = read.table("~/Desktop/vgooddata.csv", sep=",", header=TRUE)
head(goodCounts, 10)
gManualColor = "black"
gMinimumsColor = "orange"
gMaximumsColor = "turquoise"
gHistogramColor = "blue"
gDensityColor = "lightgreen"
gSplineColor = "lightpink"
gChosenColor = "red"
gSpar = NULL
drawCountDists1<-function(countsDf, manualThresholds=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
log2CurrCountsDensity<-density(log2CurrCounts, bw=0.2)
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
lines(log2CurrCountsDensity$x, log2CurrCountsDensity$y*scaleFactor,col=gDensityColor)
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
if (!is.null(manualThresholds)){
if (length(manualThresholds)>=i){
log2threshold = log2(manualThresholds[i])
rug(log2threshold, col=gManualColor, lwd=3)
}
}
}
}
drawCountDists1(goodCounts)
In his methodII.pdf provided on 20160708, Roman showed his selected threshold for this dataset:

I note that the y scales are different, apparently due to my dataset having more zeroes; I have not verified this hypothesis, but I think this is probably due to the fact that I'm visualizing the unmodified counts files while Roman's code removes all the counts for constructs with two non-targeting probes before visualization. Also, I do not have the actual numerical value of the threshold; this is perhaps less problematic than one might think given (as noted above) that I want to work from the raw counts (visualized on a log scale) rather than the log2 of the abundance-normalized counts (visualized on a linear scale). I therefore aim to approximate the position of the abundance threshold line on the distribution rather than to replicate the abundance threshold number, given the difference in units. I manually estimate that the threshold on my graph which falls at approximately the same position as the threshold on Roman's graph is 12:
goodCountsManualThresholds = c(12)
drawCountDists1(goodCounts, goodCountsManualThresholds)
The U20S Cas9Trex2 data had very low and somewhat malformed count distributions across all 8 sample+timepoints, as shown below:
poorCounts = read.table("~/Desktop/ab_y.csv", sep=",", header=TRUE)
head(poorCounts, 10)
For these samples, I selected the thresholds and Roman approved them. They were -18., -18., -16., -16., -18., -18., -16., -16 for the log2 of the abundance-normalized counts for these samples, which looked like this on the plots:




Below, I estimate the positions of these thresholds on the plots based on my preferred data and units:
poorCountsManualThresholds = c(32, 42, 20, 19, 55, 32, 23, 21)
drawCountDists1(poorCounts, poorCountsManualThresholds)
In discussions of how to select the thresholds shown above, I have been given two pieces of guidance: to look for where the histogram stops appearing quantized, and to select the apparent valley between the "noise" counts and the "main distribution". I think I have to discount the first piece of advice as it is dependent the details of the visualization and scale. The second does appear to be a useful guide. The issue, however, is that the density function isn't really capturing the shape of the (visible) histogram, which is what I as a manual selector use to make my decision.
I look around for different curve-fitting options and settle on a smoothing spline. After I take out all the points in the histogram where the frequency drops down to zero, a smoothing spline of the remainder does a good job of capturing the "curve" I see in each histogram (now shown in black below, with density still shown but in gray):
drawCountDists2<-function(countsDf, manualThresholds=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
log2CurrCountsDensity<-density(log2CurrCounts)
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
lines(log2CurrCountsDensity$x, log2CurrCountsDensity$y*scaleFactor,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
tempDf = data.frame(log2CurrCountsHist$mids, log2CurrCountsHist$count)
colnames(tempDf) = c("x", "y")
nonZeroLog2CurrCountsHist = tempDf[which(tempDf$y>0), ]
ss = smooth.spline(nonZeroLog2CurrCountsHist$x, nonZeroLog2CurrCountsHist$y, spar=gSpar)
lines(ss, col=gSplineColor)
# rug plots of manual thresholds
if (!is.null(manualThresholds)){
if (length(manualThresholds)>=i){
log2threshold = log2(manualThresholds[i])
rug(log2threshold, col=gManualColor, lwd=3)
}
}
}
}
drawCountDists2(goodCounts, goodCountsManualThresholds)
drawCountDists2(poorCounts, poorCountsManualThresholds)
Good! Now that I have a function that "sees" minima where my human reasoning does, I can return to the goal of identifying the "valley"--local minimum--between the noise and main distributions. I start by trying to identify the interior (i.e., not at either end) minima in the count distributions. The fitted spline object includes discrete x and y values, so there's no need to even invoke calculus; I follow an approach described on Stack Overflow (https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima/6836583#6836583 ):
findMinIndices<-function(objWithXandY){
indicesOfMins = which(diff(sign(diff(objWithXandY$y)))==+2)+1
return(indicesOfMins)
}
plotMins<-function(objWithXandY, indicesOfMins){
minimums = objWithXandY$x[indicesOfMins]
rug(minimums,col=gMinimumsColor,lwd=2)
}
findAndPlotMins<-function(objWithXandY){
indicesOfMins = findMinIndices(objWithXandY)
plotMins(objWithXandY, indicesOfMins)
}
drawCountDists3<-function(countsDf, manualThresholds=NULL, extremaFunc=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
log2CurrCountsDensity<-density(log2CurrCounts)
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
lines(log2CurrCountsDensity$x, log2CurrCountsDensity$y*scaleFactor,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
tempDf = data.frame(log2CurrCountsHist$mids, log2CurrCountsHist$count)
colnames(tempDf) = c("x", "y")
nonZeroLog2CurrCountsHist = tempDf[which(tempDf$y>0), ]
ss = smooth.spline(nonZeroLog2CurrCountsHist$x, nonZeroLog2CurrCountsHist$y, spar=gSpar)
lines(ss, col=gSplineColor)
# rug plots of manual thresholds
if (!is.null(manualThresholds)){
if (length(manualThresholds)>=i){
log2threshold = log2(manualThresholds[i])
rug(log2threshold, col=gManualColor, lwd=3)
}
}
# rug plots determined by external function
if (!is.null(extremaFunc)){
extremaFunc(ss)
}
}
}
drawCountDists3(goodCounts, goodCountsManualThresholds, findAndPlotMins)
drawCountDists3(poorCounts, poorCountsManualThresholds, findAndPlotMins)
Hmm, well, that has potential but there are a few issues:
For issue 1 (there is more than one minimum, and the code will need to be able to pick which one on which to anchor), how about if I say I want the right-most minimum?
findAndPlotAllAndChosenMins<-function(objWithXandY){
indicesOfMins = findMinIndices(objWithXandY)
plotMins(objWithXandY, indicesOfMins)
smallestIndex = min(indicesOfMins)
log2CountsAtChosenMin = objWithXandY$x[smallestIndex]
if (!is.null(log2CountsAtChosenMin)) {
plotExplicitThresholds(c(log2CountsAtChosenMin), areLog2=TRUE, color=gChosenColor)
} else {
print("No minimum chosen")
}
}
plotExplicitThresholds<-function(thresholdsList, thresholdIndex=1, areLog2=FALSE, color="black"){
if (!is.null(thresholdsList)){
if (length(thresholdsList)>=thresholdIndex){
currThreshold = thresholdsList[thresholdIndex]
log2threshold = if(areLog2) currThreshold else log2(currThreshold)
rug(log2threshold, col=color, lwd=3)
}
}
}
drawCountsDists4<-function(countsDf, manualThresholds=NULL, extremaFunc=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
log2CurrCountsDensity<-density(log2CurrCounts)
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
lines(log2CurrCountsDensity$x, log2CurrCountsDensity$y*scaleFactor,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
tempDf = data.frame(log2CurrCountsHist$mids, log2CurrCountsHist$count)
colnames(tempDf) = c("x", "y")
nonZeroLog2CurrCountsHist = tempDf[which(tempDf$y>0), ]
ss = smooth.spline(nonZeroLog2CurrCountsHist$x, nonZeroLog2CurrCountsHist$y, spar=gSpar)
lines(ss, col=gSplineColor)
# rug plots of manual thresholds
plotExplicitThresholds(manualThresholds, i)
# rug plots determined by external function
if (!is.null(extremaFunc)){
extremaFunc(ss)
}
}
}
drawCountsDists4(goodCounts, goodCountsManualThresholds, findAndPlotAllAndChosenMins)
drawCountsDists4(poorCounts, poorCountsManualThresholds, findAndPlotAllAndChosenMins)
Ok! The "rightmost minimum" seems to work pretty well. However, it doesn't work for the three cases called out above. Additionally, it is not capturing quite what I want for u2os_trex2a_t21_S4, where I really want the second-to-rightmost minimum because it a smaller minimum. But I don't always want the smallest local interior minimum, because, for example, I don't want any of the basically-zero local minima out at the far right end of the count distribution ... I only want the smallest local interior minimum that lies to the left of the "real" peak.
findExtremaIndices<-function(objWithXandY, getMins=TRUE){
relevantDiff = if(getMins==TRUE) 2 else -2
indicesOfExtrema = which(diff(sign(diff(objWithXandY$y)))==relevantDiff)+1
return(indicesOfExtrema)
}
getlog2CountsAndFreqsAtExtrema<-function(densityObj, indicesOfExtrema){
log2CountsAtExtrema = densityObj$x[indicesOfExtrema]
densityFunc = approxfun(densityObj)
freqsAtExtrema = densityFunc(log2CountsAtExtrema)
result = data.frame(log2CountsAtExtrema, freqsAtExtrema)
result = result[with(result, order(log2CountsAtExtrema)), ]
return(result)
}
findSmallestMinLeftOfMax1<-function(splineWithXandY){
minCountLimit = 0
minLog2CountThreshold = log2(minCountLimit)
result = NULL # assume failure
#look for row indices of local interior maxima and local interior minima in input spline curve
indicesOfMaxes = findExtremaIndices(splineWithXandY, FALSE)
indicesOfMins = findExtremaIndices(splineWithXandY, TRUE)
# give up if there aren't at least one of each; otherwise
if (length(indicesOfMaxes)>0 & length(indicesOfMins)>0){
# get x and y values in the rows representing the local interior maxima
xAndYatMaxesDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMaxes)
eligibleMaxesDf = xAndYatMaxesDf[which(
xAndYatMaxesDf$log2CountsAtExtrema >= minLog2CountThreshold), ]
# if there are no local interior maxima at x values gte the global minimum allowed, give up; otherwise
if (nrow(eligibleMaxesDf)>0){
# pick the x position of the eligible local interior maximum with the largest y value
chosenMaxDf = eligibleMaxesDf[which(
eligibleMaxesDf$freqsAtExtrema == max(eligibleMaxesDf$freqsAtExtrema)), ]
rightmostLog2Count = chosenMaxDf$log2CountsAtExtrema[1]
# get x and y values in the rows representing the local interior minima
xAndYatMinsDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMins)
eligibleMinsDf = xAndYatMinsDf[which(
xAndYatMinsDf$log2CountsAtExtrema >= minLog2CountThreshold &
xAndYatMinsDf$log2CountsAtExtrema < rightmostLog2Count), ]
# if there are no local interior minima with x values gte the global minimum allowed and
# lt the x position of the chosen maximum, give up; otherwise
if (nrow(eligibleMinsDf)>0){
# pick the x position of the eligible local interior minimum with the smallest y value
chosenMinDf = eligibleMinsDf[which(
eligibleMinsDf$freqsAtExtrema == min(eligibleMinsDf$freqsAtExtrema)), ]
result = chosenMinDf$log2CountsAtExtrema
}
}
}
return(result)
}
drawCountsDists5<-function(countsDf, manualThresholds=NULL, extremaFunc=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
log2CurrCountsDensity<-density(log2CurrCounts)
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
lines(log2CurrCountsDensity$x, log2CurrCountsDensity$y*scaleFactor,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
tempDf = data.frame(log2CurrCountsHist$mids, log2CurrCountsHist$count)
colnames(tempDf) = c("x", "y")
nonZeroLog2CurrCountsHist = tempDf[which(tempDf$y>0), ]
ss = smooth.spline(nonZeroLog2CurrCountsHist$x, nonZeroLog2CurrCountsHist$y, spar=gSpar)
lines(ss, col=gSplineColor)
# rug plots of manual thresholds
plotExplicitThresholds(manualThresholds, i)
# rug plots determined by external function
if (!is.null(extremaFunc)){
putativeThreshold = extremaFunc(ss)
plotExplicitThresholds(putativeThreshold, 1, areLog2=TRUE, color=gChosenColor)
}
}
}
drawCountsDists5(goodCounts, goodCountsManualThresholds, findSmallestMinLeftOfMax1)
drawCountsDists5(poorCounts, poorCountsManualThresholds, findSmallestMinLeftOfMax1)
Not bad. This logic fixes the complaint I had about u2os_trex2a_t21_S4.
Cases that still look bad to me are Hela.CV4R4R_d3_1_S7 (chosen threshold is too low), U2OS.CAS9TREX2B.t14_S2 (no threshold is chosen), and U2OS.CAS9TREX2A.t28_S5 (chosen threshold is too low). Well, much as I dislike it, I think I may need to introduce a global minimum allowable threshold. Note that this minimum will apply to ALL samples, and I think can be reasonably set ahead of time, so it will still be preferable to the per-sample abundance threshold whose selection I'm trying to automate here.
Let's try a global minimum of 10 counts ... 10 counts seems to me like a plausible minimum for the abundance threshold:
findSmallestMinLeftOfMax2<-function(splineWithXandY){
minCountLimit = 10
minLog2CountThreshold = log2(minCountLimit)
result = NULL # assume failure
#look for row indices of local interior maxima and local interior minima in input spline curve
indicesOfMaxes = findExtremaIndices(splineWithXandY, FALSE)
indicesOfMins = findExtremaIndices(splineWithXandY, TRUE)
# give up if there aren't at least one of each; otherwise
if (length(indicesOfMaxes)>0 & length(indicesOfMins)>0){
# get x and y values in the rows representing the local interior maxima
xAndYatMaxesDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMaxes)
eligibleMaxesDf = xAndYatMaxesDf[which(
xAndYatMaxesDf$log2CountsAtExtrema >= minLog2CountThreshold), ]
# if there are no local interior maxima at x values gte the global minimum allowed, give up; otherwise
if (nrow(eligibleMaxesDf)>0){
# pick the x position of the eligible local interior maximum with the largest y value
chosenMaxDf = eligibleMaxesDf[which(
eligibleMaxesDf$freqsAtExtrema == max(eligibleMaxesDf$freqsAtExtrema)), ]
rightmostLog2Count = chosenMaxDf$log2CountsAtExtrema[1]
# get x and y values in the rows representing the local interior minima
xAndYatMinsDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMins)
eligibleMinsDf = xAndYatMinsDf[which(
xAndYatMinsDf$log2CountsAtExtrema >= minLog2CountThreshold &
xAndYatMinsDf$log2CountsAtExtrema < rightmostLog2Count), ]
# if there are no local interior minima with x values gte the global minimum allowed and
# lt the x position of the chosen maximum, give up; otherwise
if (nrow(eligibleMinsDf)>0){
# pick the x position of the eligible local interior minimum with the smallest y value
chosenMinDf = eligibleMinsDf[which(
eligibleMinsDf$freqsAtExtrema == min(eligibleMinsDf$freqsAtExtrema)), ]
result = chosenMinDf$log2CountsAtExtrema
}
}
}
return(result)
}
drawCountsDists5(goodCounts, goodCountsManualThresholds, findSmallestMinLeftOfMax2)
drawCountsDists5(poorCounts, poorCountsManualThresholds, findSmallestMinLeftOfMax2)
I'm going to cautiously call that an improvement: I am now getting no selected threshold (which is preferable to a garbage one) for Hela.CV4R4R_d3_1_S7 and U2OS.CAS9TREX2B.t14_S2.
Unfortunately, U2OS.CAS9TREX2A.t28_S5 has gone from picking too low a threshold to picking too high a one ... I don't really believe the local max to the right of that picked threshold is the "real" distribution, particularly because that threshold appears to be cutting out basically all of the data. While it's true the "real" distribution could be very small (as in U2OS.CAS9TREX2B.t28_S6), I think I don't believe any threshold pick for which it is actually imperceptible ... and even if that was a realistic threshold pick, the data would probably be unusable. I'm going to bite the heuristic bullet and introduce another arbitrary global limit, this time on the fraction of the counts that an acceptable threshold can exclude. Let's try 0.8%:
extremumIsAcceptable<-function(putativeThreshold, hist, maxCountFractionExcluded){
result = FALSE
if (!is.null(putativeThreshold)) {
fractionCountsExcluded = getFractionCountsExcluded(hist, putativeThreshold,
maxCountFractionExcluded)
result = fractionCountsExcluded<maxCountFractionExcluded
}
return(result)
}
getFractionCountsExcluded<-function(hist, putativeThreshold, maxCountFractionExcluded){
tempHistDf = data.frame(mids=hist$mids, counts=hist$counts)
eligibleHistDf = tempHistDf[which(hist$mids<putativeThreshold), ]
result = sum(eligibleHistDf$counts)/sum(hist$counts)
return(result)
}
findSmallestMinLeftOfMax3<-function(splineWithXandY, hist){
minLog2CountThreshold = log2(10)
maxCountFractionExcluded = 0.8
result = NULL # assume failure
#look for row indices of local interior maxima and local interior minima in input spline curve
indicesOfMaxes = findExtremaIndices(splineWithXandY, FALSE)
indicesOfMins = findExtremaIndices(splineWithXandY, TRUE)
# give up if there aren't at least one of each; otherwise
if (length(indicesOfMaxes)>0 & length(indicesOfMins)>0){
# get x and y values in the rows representing the local interior maxima
xAndYatMaxesDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMaxes)
eligibleMaxesDf = xAndYatMaxesDf[which(
xAndYatMaxesDf$log2CountsAtExtrema >= minLog2CountThreshold), ]
# if there are no local interior maxima at x values gte the global minimum allowed, give up; otherwise
if (nrow(eligibleMaxesDf)>0){
# pick the x position of the eligible local interior maximum with the largest y value
chosenMaxDf = eligibleMaxesDf[which(
eligibleMaxesDf$freqsAtExtrema == max(eligibleMaxesDf$freqsAtExtrema)), ]
rightmostLog2Count = chosenMaxDf$log2CountsAtExtrema[1]
# get x and y values in the rows representing the local interior minima
xAndYatMinsDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMins)
eligibleMinsDf = xAndYatMinsDf[which(
xAndYatMinsDf$log2CountsAtExtrema >= minLog2CountThreshold &
xAndYatMinsDf$log2CountsAtExtrema < rightmostLog2Count), ]
# if there are no local interior minima with x values gte the global minimum allowed and
# lt the x position of the chosen maximum, give up; otherwise
if (nrow(eligibleMinsDf)>0){
# pick the x position of the eligible local interior minimum with the smallest y value
chosenMinDf = eligibleMinsDf[which(
eligibleMinsDf$freqsAtExtrema == min(eligibleMinsDf$freqsAtExtrema)), ]
putativeResult = chosenMinDf$log2CountsAtExtrema
# Only known situation where above logic picks a bad threshold is when all "real"
# data is monotonically decreasing but there is (at least one) minute local maximum
# in the noise at far right of the count distribution; extremumIsAcceptable sanity-checks
# for that pathological case.
if (extremumIsAcceptable(putativeResult, hist, maxCountFractionExcluded)){
result = putativeResult
}
}
}
}
return(result)
}
drawCountsDists6<-function(countsDf, manualThresholds=NULL, extremaFunc=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
log2CurrCountsDensity<-density(log2CurrCounts)
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
lines(log2CurrCountsDensity$x, log2CurrCountsDensity$y*scaleFactor,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
tempDf = data.frame(log2CurrCountsHist$mids, log2CurrCountsHist$count)
colnames(tempDf) = c("x", "y")
nonZeroLog2CurrCountsHist = tempDf[which(tempDf$y>0), ]
ss = smooth.spline(nonZeroLog2CurrCountsHist$x, nonZeroLog2CurrCountsHist$y, spar=gSpar)
lines(ss, col=gSplineColor)
# rug plots of manual thresholds
plotExplicitThresholds(manualThresholds, i)
# rug plots determined by external function
if (!is.null(extremaFunc)){
putativeThreshold = extremaFunc(ss, log2CurrCountsHist)
plotExplicitThresholds(putativeThreshold, 1, areLog2=TRUE, color=gChosenColor)
}
}
}
drawCountsDists6(goodCounts, goodCountsManualThresholds, findSmallestMinLeftOfMax3)
drawCountsDists6(poorCounts, poorCountsManualThresholds, findSmallestMinLeftOfMax3)
Ok, now none of the three problem cases get garbage threshold picks ... yay, I think?
However, for all three of these problem cases we were able to assign manual picks, so if it wasn't by looking for a minimum in valley, how did we do it?
In those cases, I think that I was attempting to intuit the point at which a poisson-like (or NB-like, or whatever) distribution of noisy counts gave way to a Gaussian-like distribution of "real" counts. Unfortunately (work not shown--thank me for that) it turns out that R packages for fitting mixture models of two different kinds of distributions to a set of data are both rare and hard to work with. However, maybe I don't actually have to ... The density function's default kernel is Gaussian, and it seems good at fitting the Gaussian-ish parts of the test cases but bad at fitting the Poisson-ish parts. The spline, on the other hand, fits both pretty well. So what? So, when the spline and the density function converge, then we're past the noisy data and into the "real" data! Hey, presto :)
Well, ok, I don't quite want the point where they converge ... looking at the manual thresholds for the outstanding test cases, it seems I want the point where they get "near enough" to each other. Argh, another arbitrary choice: how near is near enough? The absolute value of "near enough" clearly differs for data with different frequency distributions, but it seems to me that the fraction of difference between maximum spline value and minimum density value is pretty close. Let's try implementing that, with a threshold of 0.05 (5%) ... and I think I'll keep the minimum and maxiumum limits that I already wrote for the smallest-min-in-the-valley approach here, too.
makeSplineAndDensityDf<-function(scaledDensityXandY, splineXandY){
# Determine spline and (scaled) density y values at shared set of x values
# where neither is NA, then calculate difference between spline and density
# at each of those points.
splineFunc = approxfun(splineXandY)
splineYAtDensityX = splineFunc(scaledDensityXandY$x)
result = data.frame(x=scaledDensityXandY$x, splineY=splineYAtDensityX,
densityY=scaledDensityXandY$y)
result = na.omit(result)
result$y = result$splineY-result$densityY
return(result)
}
getNearnessThreshold<-function(splineAndDensityDf, maxSplineDensityDiff){
# Get global maximum value of spline function at any x
# Get global minimum of scaled density function at any x
# NB: x for max and min need not (and usually won't) be the same
# Use these values to find the maximum difference between spline
# and scaled density y values (regardless of x), then define
# "near" to be when spline and scaled density y values for same x get within
# the specified arbitrary fraction of that difference.
maxSplineY = max(splineAndDensityDf$splineY)
minDensityY = min(splineAndDensityDf$densityY)
maxDiff = maxSplineY - minDensityY
result = maxDiff * maxSplineDensityDiff
return(result)
}
findSplineAndDensityNearPoint1<-function(scaledDensityXandY, splineXandY, hist){
maxSplineDensityDiff = 0.05
log2minCountLimit = log2(10)
maxCountFractionExcluded = 0.8
result = NULL # assume failure
splineAndDensityDf = makeSplineAndDensityDf(scaledDensityXandY, splineXandY)
nearnessThreshold = getNearnessThreshold(splineAndDensityDf, maxSplineDensityDiff)
# if there are no records whose x positions are gte the global minimum allowed,
# give up; otherwise
eligibleSplineAndDensityDf = splineAndDensityDf[which(
splineAndDensityDf$x >= log2minCountLimit), ]
if (nrow(eligibleSplineAndDensityDf)>0){
# Walk through all eligible x positions, from smallest toward largest.
# Assuming you don't get lucky and just find an x value right on the threshold,
# find the pair of x positions (if any such exist) that bracket the
# spot where the spline and density curves get "near enough" to each other.
# Return the point half-way between these two x positions (note that this is
# obviously a punt--I *could* do numerical approximation to find it, or
# set up a function that reached zero when the spline and density were
# "near enough" and then optimize it, but frankly it just doesn't seem
# worth the trouble ...)
putativeResult = NULL
largestXgtThresh = NULL
smallestXltThresh = NULL
for (i in 1:nrow(eligibleSplineAndDensityDf)){
currYval = eligibleSplineAndDensityDf$y[i]
currXval = eligibleSplineAndDensityDf$x[i]
if (currYval == nearnessThreshold) {
putativeResult = currXval
break
} else if (currYval <= nearnessThreshold) {
smallestLtThresh = currXval
if (is.null(largestXgtThresh)) {
putativeResult = smallestLtThresh
} else {
putativeResult = (smallestLtThresh - largestXgtThresh)/2 + largestXgtThresh
}
break
} else { # (currYval > nearnessThreshold)
largestXgtThresh = currXval
}
}
if (extremumIsAcceptable(putativeResult, hist, maxCountFractionExcluded)){
result = putativeResult
}
}
return(result)
}
drawCountsDists7<-function(countsDf, manualThresholds=NULL, extremaFunc=NULL){
countsDf[countsDf==0]<-1 #pseudocounts
rge<-range(log2(countsDf))
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
for (i in 1:(length(colnames(countsDf)))) {
log2CurrCounts<-log2(countsDf[,i])
log2CurrCountsHist<-hist(log2CurrCounts,
breaks=seq(0-0.05,rge[2]+0.05,by=0.05),
main=colnames(countsDf)[i],
col=gHistogramColor,
border=FALSE,
xaxt = 'n',
xlab="")
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
scaleFactor = sum(log2CurrCountsHist$counts)*0.05
log2CurrCountsDensity<-density(log2CurrCounts)
scaledLog2CurrCountsDensityDf = data.frame(x=log2CurrCountsDensity$x,
y=log2CurrCountsDensity$y*scaleFactor)
lines(scaledLog2CurrCountsDensityDf,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
tempDf = data.frame(log2CurrCountsHist$mids, log2CurrCountsHist$count)
colnames(tempDf) = c("x", "y")
nonZeroLog2CurrCountsHist = tempDf[which(tempDf$y>0), ]
ss = smooth.spline(nonZeroLog2CurrCountsHist$x, nonZeroLog2CurrCountsHist$y, spar=gSpar)
lines(ss, col=gSplineColor)
# rug plots of manual thresholds
plotExplicitThresholds(manualThresholds, i)
# rug plots determined by external function
if (!is.null(extremaFunc)){
putativeThreshold = extremaFunc(scaledLog2CurrCountsDensityDf, ss, log2CurrCountsHist)
plotExplicitThresholds(putativeThreshold, 1, areLog2=TRUE, color=gChosenColor)
}
}
}
drawCountsDists7(goodCounts, goodCountsManualThresholds, findSplineAndDensityNearPoint1)
drawCountsDists7(poorCounts, poorCountsManualThresholds, findSplineAndDensityNearPoint1)
Ooo, shiny! Ok, so this approach isn't as good as the smallest-min-in-the-valley approach for the test cases for which that approach actually finds something, but for the three cases for which that one fails, this approach picks thresholds that make sense. Let's combine them and call it good to go!
gMinCountLimit = 10 #Note: in absolute counts, not log2
gMaxFractionAcceptableSplineDensityDiff = 0.05 # % of diff between max spline and min density
gMaxFractionCountsExcluded = 0.8 # any threshold throwing out >x% of counts is not acceptable
extremumIsAcceptable<-function(putativeThreshold, hist, maxCountFractionExcluded){
result = FALSE
if (!is.null(putativeThreshold)) {
fractionCountsExcluded = getFractionCountsExcluded(hist, putativeThreshold,
maxCountFractionExcluded)
result = fractionCountsExcluded<maxCountFractionExcluded
}
return(result)
}
getFractionCountsExcluded<-function(hist, putativeThreshold, maxCountFractionExcluded){
tempHistDf = data.frame(mids=hist$mids, counts=hist$counts)
eligibleHistDf = tempHistDf[which(hist$mids<putativeThreshold), ]
result = sum(eligibleHistDf$counts)/sum(hist$counts)
return(result)
}
findExtremaIndices<-function(objWithXandY, getMins=TRUE){
relevantDiff = if(getMins==TRUE) 2 else -2
indicesOfExtrema = which(diff(sign(diff(objWithXandY$y)))==relevantDiff)+1
return(indicesOfExtrema)
}
getlog2CountsAndFreqsAtExtrema<-function(densityObj, indicesOfExtrema){
log2CountsAtExtrema = densityObj$x[indicesOfExtrema]
densityFunc = approxfun(densityObj)
freqsAtExtrema = densityFunc(log2CountsAtExtrema)
result = data.frame(log2CountsAtExtrema, freqsAtExtrema)
result = result[with(result, order(log2CountsAtExtrema)), ]
return(result)
}
# general concept: identify the peak of the "main distribution", then look for the lowest point in the
# "valley" between it and the noise spike at the low end of the counts histogram.
# Not all count distributions have this general shape; for all known cases that don't, this method
# will return NULL (rather than just silently picking a bad threshold).
findSmallestMinLeftOfMax<-function(splineWithXandY, minCountLimit, hist, maxCountFractionExcluded){
minLog2CountThreshold = log2(minCountLimit)
result = NULL # assume failure
#look for row indices of local interior maxima and local interior minima in input spline curve
indicesOfMaxes = findExtremaIndices(splineWithXandY, FALSE)
indicesOfMins = findExtremaIndices(splineWithXandY, TRUE)
# give up if there aren't at least one of each; otherwise
if (length(indicesOfMaxes)>0 & length(indicesOfMins)>0){
# get x and y values in the rows representing the local interior maxima
xAndYatMaxesDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMaxes)
eligibleMaxesDf = xAndYatMaxesDf[which(
xAndYatMaxesDf$log2CountsAtExtrema >= minLog2CountThreshold), ]
# if there are no local interior maxima at x values gte the global minimum allowed, give up; otherwise
if (nrow(eligibleMaxesDf)>0){
# pick the x position of the eligible local interior maximum with the largest y value
chosenMaxDf = eligibleMaxesDf[which(
eligibleMaxesDf$freqsAtExtrema == max(eligibleMaxesDf$freqsAtExtrema)), ]
rightmostLog2Count = chosenMaxDf$log2CountsAtExtrema[1]
# get x and y values in the rows representing the local interior minima
xAndYatMinsDf = getlog2CountsAndFreqsAtExtrema(splineWithXandY, indicesOfMins)
eligibleMinsDf = xAndYatMinsDf[which(
xAndYatMinsDf$log2CountsAtExtrema >= minLog2CountThreshold &
xAndYatMinsDf$log2CountsAtExtrema < rightmostLog2Count), ]
# if there are no local interior minima with x values gte the global minimum allowed and
# lt the x position of the chosen maximum, give up; otherwise
if (nrow(eligibleMinsDf)>0){
# pick the x position of the eligible local interior minimum with the smallest y value
chosenMinDf = eligibleMinsDf[which(
eligibleMinsDf$freqsAtExtrema == min(eligibleMinsDf$freqsAtExtrema)), ]
putativeResult = chosenMinDf$log2CountsAtExtrema
# Only known situation where above logic picks a bad threshold is when all "real"
# data is monotonically decreasing but there is (at least one) minute local maximum
# in the noise at far right of the count distribution; extremumIsAcceptable sanity-checks
# for that pathological case.
if (extremumIsAcceptable(putativeResult, hist, maxCountFractionExcluded)){
result = putativeResult
}
}
}
}
return(result)
}
# helper for findSplineAndDensityNearPoint
makeSplineAndDensityDf<-function(scaledDensityXandY, splineXandY){
# Determine spline and (scaled) density y values at shared set of x values
# where neither is NA, then calculate difference between spline and density
# at each of those points.
splineFunc = approxfun(splineXandY)
splineYAtDensityX = splineFunc(scaledDensityXandY$x)
result = data.frame(x=scaledDensityXandY$x, splineY=splineYAtDensityX,
densityY=scaledDensityXandY$y)
result = na.omit(result)
result$y = result$splineY-result$densityY
return(result)
}
# helper for findSplineAndDensityNearPoint
getNearnessThreshold<-function(splineAndDensityDf, maxSplineDensityDiff){
# Get global maximum value of spline function at any x
# Get global minimum of scaled density function at any x
# NB: x for max and min need not (and usually won't) be the same
# Use these values to find the maximum difference between spline
# and scaled density y values (regardless of x), then define
# "near" to be when spline and scaled density y values for same x get within
# the specified arbitrary fraction of that difference.
maxSplineY = max(splineAndDensityDf$splineY)
minDensityY = min(splineAndDensityDf$densityY)
maxDiff = maxSplineY - minDensityY
result = maxDiff * maxSplineDensityDiff
return(result)
}
# general concept: find the leftmost point (greater than the global minimum allowed)
# in the count distribution where the scaled density curve and the spline curve are
# within the global arbitrary threshold of one another.
# This gives worse results than findSmallestMinLeftOfMax on "good" count distributions,
# so it isn't the first-choice approach, but it makes a good fall-back for count
# distributions (especially noisy or low-signal ones) where findSmallestMinLeftOfMax
# fails to find a threshold. Fails to find a threshold ONLY in cases where
# spline and density curve never get "near" each other over range of
# counts in the underlying count distribution.
findSplineAndDensityNearPoint<-function(scaledDensityXandY, splineXandY, minCountLimit,
maxFractionAcceptableSplineDensityDiff, hist, maxCountFractionExcluded){
log2minCountLimit = log2(minCountLimit)
maxSplineDensityDiff = maxFractionAcceptableSplineDensityDiff
result = NULL # assume failure
splineAndDensityDf = makeSplineAndDensityDf(scaledDensityXandY, splineXandY)
nearnessThreshold = getNearnessThreshold(splineAndDensityDf, maxSplineDensityDiff)
# if there are no records whose x positions are gte the global minimum allowed,
# give up; otherwise
eligibleSplineAndDensityDf = splineAndDensityDf[which(
splineAndDensityDf$x >= log2minCountLimit), ]
if (nrow(eligibleSplineAndDensityDf)>0){
# Walk through all eligible x positions, from smallest toward largest.
# Assuming you don't get lucky and just find an x value right on the threshold,
# find the pair of x positions (if any such exist) that bracket the
# spot where the spline and density curves get "near enough" to each other.
# Return the point half-way between these two x positions (note that this is
# obviously a punt--I *could* do numerical approximation to find it, or
# set up a function that reached zero when the spline and density were
# "near enough" and then optimize it, but frankly it just doesn't seem
# worth the trouble ...)
putativeResult = NULL
largestXgtThresh = NULL
smallestXltThresh = NULL
for (i in 1:nrow(eligibleSplineAndDensityDf)){
currYval = eligibleSplineAndDensityDf$y[i]
currXval = eligibleSplineAndDensityDf$x[i]
if (currYval == nearnessThreshold) {
putativeResult = currXval
break
} else if (currYval <= nearnessThreshold) {
smallestLtThresh = currXval
if (is.null(largestXgtThresh)) {
putativeResult = smallestLtThresh
} else {
putativeResult = (smallestLtThresh - largestXgtThresh)/2 + largestXgtThresh
}
break
} else { # (currYval > nearnessThreshold)
largestXgtThresh = currXval
}
}
if (extremumIsAcceptable(putativeResult, hist, maxCountFractionExcluded)){
result = putativeResult
}
}
return(result)
}
analyzeCountsDist<-function(log2countsDfForSingleSample, rangeObj, minCountLimit, maxCountFractionExcluded,
maxFractionAcceptableSplineDensityDiff){
resultSummary = "No acceptable extremum found." # assume failure
rge<-rangeObj
increment = 0.05
log2CurrCountsHist<-hist(log2countsDfForSingleSample,
breaks=seq(0-increment,rge[2]+increment,by=increment),
plot=FALSE)
# density curve
scaleFactor = sum(log2CurrCountsHist$counts)*increment
log2CurrCountsDensity<-density(log2countsDfForSingleSample)
scaledLog2CurrCountsDensityDf = data.frame(x=log2CurrCountsDensity$x,
y=log2CurrCountsDensity$y*scaleFactor)
# smoothing spline curve of non-zero freqs only
log2CurrCountsHistXandY = data.frame(x=log2CurrCountsHist$mids, y=log2CurrCountsHist$count)
nonZeroLog2CurrCountsHistXandY = log2CurrCountsHistXandY[which(log2CurrCountsHistXandY$y>0), ]
log2CurrCountsSpline = smooth.spline(nonZeroLog2CurrCountsHistXandY$x, nonZeroLog2CurrCountsHistXandY$y)
# threshold selection
putativeThreshold = findSmallestMinLeftOfMax(log2CurrCountsSpline, minCountLimit,
log2CurrCountsHist, maxCountFractionExcluded)
if (!is.null(putativeThreshold)){
resultSummary = "Smallest-min-left-of-max method used."
} else {
putativeThreshold = findSplineAndDensityNearPoint(scaledLog2CurrCountsDensityDf, log2CurrCountsSpline,
minCountLimit, maxFractionAcceptableSplineDensityDiff, log2CurrCountsHist, maxCountFractionExcluded)
if (!is.null(putativeThreshold)){
resultSummary = "Leftmost-spline-and-density-within-x-percent method used."
}
}
result = list(threshold = putativeThreshold, resultSummary=resultSummary,
histogram=log2CurrCountsHist,
scaledDensity=scaledLog2CurrCountsDensityDf, spline=log2CurrCountsSpline)
return(result)
}
drawAnalyzedCountsDist<-function(sampleName, rangeObj, analysisResult, manualThreshold=NULL){
rge<-rangeObj
xPositions = seq(from = 0, to = ceiling(rge[2])+1, by = 1)
xLabels = 2^(xPositions)
titleText = paste0(sampleName,"\n", analysisResult$resultSummary)
hist = analysisResult$histogram
plot(hist,
col=gHistogramColor,
border=FALSE,
main=titleText,
xaxt = 'n',
xlab=""
)
axis(side = 1, at = xPositions, labels=xLabels, las=2)
mtext("counts (pseudocount added to zeros only)", side=1, line=3)
# density curve
lines(analysisResult$scaledDensity,col=gDensityColor)
# smoothing spline curve of non-zero freqs only
lines(analysisResult$spline, col=gSplineColor)
# rug plot of manual threshold, if any
if (!is.null(manualThreshold)){
rug(manualThreshold, col=gManualColor, lwd=3)
}
# vertical line of selected threshold, if any
analysisThreshold = analysisResult$threshold
if (!is.null(analysisThreshold)){
abline(v=analysisThreshold, col=gChosenColor)
fractionExcludedCounts = getFractionCountsExcluded(analysisResult$histogram,
analysisThreshold, maxCountFractionExcluded)
percentExcludedCounts = fractionExcludedCounts*100
title(sub=paste0(format(round(percentExcludedCounts, 1), nsmall = 1), "% of counts excluded"))
}
}
analyzeAndDrawCountsDists<-function(multiSampleCountsDf, manualThresholds=NULL){
minCountLimit = gMinCountLimit
maxCountFractionExcluded = gMaxFractionCountsExcluded
maxFractionAcceptableSplineDensityDiff = gMaxFractionAcceptableSplineDensityDiff
multiSampleCountsDf[multiSampleCountsDf==0]<-1 #pseudocounts
log2MultiSampleCountsDf = log2(multiSampleCountsDf)
rangeObj = range(log2MultiSampleCountsDf)
for (i in 1:ncol(multiSampleCountsDf)) {
currSampleName = colnames(multiSampleCountsDf)[i]
log2countsDfForSingleSample = log2MultiSampleCountsDf[, i]
analysisResult = analyzeCountsDist(log2countsDfForSingleSample, rangeObj,
minCountLimit, maxCountFractionExcluded, maxFractionAcceptableSplineDensityDiff)
currManualThreshold = NULL
if (!is.null(manualThresholds)){
if (length(manualThresholds)>=i){
currManualThreshold = log2(manualThresholds[i])
}
}
drawAnalyzedCountsDist(currSampleName, rangeObj, analysisResult, currManualThreshold)
}
}
Now to run the combined version on the training data:
analyzeAndDrawCountsDists(goodCounts, goodCountsManualThresholds)
analyzeAndDrawCountsDists(poorCounts, poorCountsManualThresholds)
By George, I think she's got it!
Well, having a heuristic look good on its training data is nice, but that and $3.50 will get you a latte. How does this work on data it wasn't built to fit?
I'm going to take the good test data and mess with it to create some pathological cases, and see how those look. First, I'll multiply all the counts by 100 to make a much wider distribution:
inflatedCounts = goodCounts*100
head(inflatedCounts, 10)
analyzeAndDrawCountsDists(inflatedCounts)
Still looks sensible.
Now instead I'll add 100 counts to everything in the good test data, to see how the heuristic performs on a distribution that doesn't originate at 1:
addedCounts = goodCounts+100
head(addedCounts, 10)
analyzeAndDrawCountsDists(addedCounts)
Yep, looks good.
Now for the real test. I've gathered up all the counts data for every dual-CRISPR screen we have processed to-date; let's see how the heuristic works for all these data. (Note that the original training data are in this complete set, so you'll see them in the results below.) I don't have manual threshold info for these, so my best-case-scenario is that the heuristically-chosen thresholds (or lack thereof) seem "sensible" to me when I examine them manually.
applyToAllCombinedCounts<-function(combinedCountsDirFp){
filePaths <- list.files(path=combinedCountsDirFp, pattern="*_counts_combined.txt", full.names=T, recursive=FALSE)
for (i in 1:length(filePaths)){
currCounts = read.table(filePaths[i], header=TRUE, row.names=1)
analyzeAndDrawCountsDists(currCounts)
}
}
applyToAllCombinedCounts("~/Desktop/mali_combined_counts")
Hmm, not bad, not bad at all! Only two samples really trouble me: 1) UCBto_S2, for which the "near point" method was used to choose a threshold when I think it should have been the "smallest minimum in valley" method, and 2) LCTBt23_S5, where no minimum was found when I think one should have been. Tracing the reason for both of these (work not shown), I find that the thresholds I would manually choose throw out more than 80% of the data, thus falling afoul of the global limit I set above.
I therefore up the fraction-excluded limit to 95%. I also think I might like the "near point" of the spline/density curves to be a little nearer, so I tune that down to 2%. I can't guarantee these limits will be plausible for every data set, of course, but at least I can say they are reasonable choices for every dual-CRISPR data set I've ever seen :)
gMinCountLimit = 10 #Note: in absolute counts, not log2. Unchanged from earlier--10 seems to work well.
gMaxFractionAcceptableSplineDensityDiff = 0.02 # % of diff between max spline and min density
gMaxFractionCountsExcluded = 0.95 # any threshold throwing out >x% of counts is not acceptable
applyToAllCombinedCounts("~/Desktop/mali_combined_counts")
The revised global limits take care of the two I was concerned about. There's still one case (RuCTBt14_S7) where the heuristic makes a different pick than I would:

In this case, I would manually put the pick right in the "valley", but the heuristic doesn't because the local maximum to the left of tht valley is the highest and is greater than the global limit. However, the "near point" threshold that is chosen instead by the heuristic is not a bad choice. I think I can live with this imperfection.
I am content with the developed heuristic's ability to pick reasonable thresholds on test data it has never seen before, and believe it is ready to implement as part of the processing pipeline. However, because future data could have distributional properties that violate the assumptions of the heuristic (or require adjustment of one or more of the three global limits), I think that plots like those above should be part of the output of the pipeline and should be examined by a human a sanity check after the pipeline completes (similar to the way we currently examine the count boxplots). As long as they look good, the pipeline results can be used; if they look worrisome, then the pipeline results can be thrown out and rerun with adjusted parameters.